Part1
This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from GEO and has undergone some minor alterations. See the script process_tcga_clinical.R.
The data are provided as the file tcga_clinical.tsv in the raw_data directory of the r_crash_course.zip file
Exercise: What function from readr would you use to read the file tcga_clinical.tsv into R? Read the file in. What are the number of rows and columns?
library(readr)
data <- read_tsv("raw_data/tcga_clinical.tsv")
Warning: One or more parsing issues, see `problems()` for details
Rows: 7706 Columns: 420
── Column specification ───────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (395): bcr_patient_barcode, bcr_patient_uuid, form_completion_date, prospective...
dbl (23): initial_pathologic_dx_year, age_at_diagnosis, percent_blasts_peripheral_...
lgl (2): sarcomatoid_features, sarcomatoid_percent_of_tumor
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or Viewing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.
Exercise: Use the select function in conjunction with contains and starts_with to identify columns that have Age or Stage information their name. The code should look like the following (you will need to fill-in the dots).
The functions contains and starts_with perform similar operations when used to select columns from a data frame. To use either, and to use the select function, we first have to load the dplyr library. Using the contains function will identify all columns that have a particular text pattern somewhere in their name. If we wanted all the columns with “age” in the name, the following wouldn’t be a good choice as it would also identify columns with “stage” in.
library(dplyr)
select(data, contains("age"))
But if we wanted all the columns regarding “stage”, contains would be a good choice
select(data, contains("stage"))
Since the age-related columns start with “age” we can use the starts_with function instead.
select(data, starts_with("age"))
select(data, contains("age"),
-contains("stage"),
-contains("agent"),
-contains("heritage"),
-contains("percentage"))
Exercise: Use the select function to create a new data frame that contains the following columns. These are not the actual columns names - Tumour site - Race - Gender - Age at diagnosis - Dead / Alive Status You can add extra columns if you wish
See below for example output
library(tidyverse)
clin <- readr::read_tsv("raw_data/tcga_clinical.tsv")
Warning: One or more parsing issues, see `problems()` for details
Rows: 7706 Columns: 420
── Column specification ───────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (395): bcr_patient_barcode, bcr_patient_uuid, form_completion_date, prospective...
dbl (23): initial_pathologic_dx_year, age_at_diagnosis, percent_blasts_peripheral_...
lgl (2): sarcomatoid_features, sarcomatoid_percent_of_tumor
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- select(clin,
tumor_tissue_site,
race,
gender,
age_at_initial_pathologic_diagnosis,
vital_status)
data
Exercise: Use the dplyr function called count to tabulate how which sites are included in the data. Re-arrange the output from count using arrange to determine the most common type of cancer in the dataset.
See below for example output
count(data, tumor_tissue_site) %>% arrange(desc(n))
NA
Exercise: Not all samples have an entry for tumour type. Use the filter function to create a table with valid entries for tumor_tissue_site. Create a barplot to show display the number of occurences of each tumour type
HINT: An easy way to make the labels on the x-axis more legible is to use the coord_flip function
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
See below for example output
filter(data,!is.na(tumor_tissue_site)) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()

Part2
We would like to visualise the age of diagnosis, and eventually compare between different disease types, The code we might think to use initially could look like:-
## assuming your filtered clinical data is called data
ggplot(data, aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
Warning: Groups with fewer than two data points have been dropped.
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf

This doesn’t look like the desired output though. If we re-visit the data frame and print the “age” column we notice that the entries in the column are stored as “chr”. i.e. characters or text
select(data, age_at_initial_pathologic_diagnosis)
This has occurred because some entries are “[Not Available]” rather than a number or NA. As soon as R finds any text within the column, it treats everything in the column as text.
These entries can be filtered in the same manner as previously (when filtering the tissue type column), but this does not solve the problem entirely.
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
Warning: Groups with fewer than two data points have been dropped.
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf

We need to add an additional step which will force R to treat the data in the age_at_initial_pathologic_diagnosis column as numerical data. Such a conversion can be done using the as.numeric function and the mutate function can be used to modify the age_at_initial_pathologic_diagnosis column to contain the numeric values
Exercise: Use mutate and as.numeric to convert the values in age_at_initial_pathologic_diagnosis into numbers. You will still need to remove the [Not Available] values beforehand. Now try and create the density plot.
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()

Exercise: Use the facet_wrap function to compare the distribution of ages between different tumour types
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
filter(tumor_tissue_site != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density() + facet_wrap(~tumor_tissue_site)
Warning: Groups with fewer than two data points have been dropped.
Warning: Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
Warning in max(ids, na.rm = TRUE) :
no non-missing arguments to max; returning -Inf

Exercise: Do any tumour types have a different age of diagnosis between males and females? Use a boxplot to find out
data %>%
filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>%
filter(tumor_tissue_site != "[Not Available]") %>%
mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>%
ggplot(aes(x= gender, y = age_at_initial_pathologic_diagnosis)) + geom_boxplot() + facet_wrap(~tumor_tissue_site)

Lets now look at gender split for each cancer type. As a first step, we can group the data by gender and tissue type and obtain counts.
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n())
`summarise()` has grouped output by 'tumor_tissue_site'. You can override using the `.groups` argument.
These data are ready for plotting, but for comparisons we need to take into account the total number of each tissue type. We can create frequencies rather than absolute numbers by dividing by the total number of cases.
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N))
`summarise()` has grouped output by 'tumor_tissue_site'. You can override using the `.groups` argument.
Note that the order of the grouping is important here. If we reversed it to gender then tumor_tissue_site the frequencies would be calculated using the total of males of females.
data %>%
group_by(gender,tumor_tissue_site) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N))
`summarise()` has grouped output by 'gender'. You can override using the `.groups` argument.
Exercise: Create a plot to show the gender split in cases of each tumor type.
data %>%
group_by(tumor_tissue_site,gender) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = gender, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
`summarise()` has grouped output by 'tumor_tissue_site'. You can override using the `.groups` argument.

Exercise: Create a plot to show the proportion of patients dead or alive for each tumour type
data %>%
group_by(tumor_tissue_site,vital_status) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
filter(vital_status != "[Not Available]") %>%
summarise(N = n()) %>%
mutate(freq = N / sum(N)) %>%
ggplot(aes(x = vital_status, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
`summarise()` has grouped output by 'tumor_tissue_site'. You can override using the `.groups` argument.

---
title: "R crash course exercise"
output: 
  html_notebook: 
    css: stylesheets/styles.css
---

# Part1

This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944) and has undergone some minor alterations. See the script [process_tcga_clinical.R](/process_tcga_clinical.R).

The data are provided as the file `tcga_clinical.tsv` in the `raw_data` directory of the `r_crash_course.zip` file

<div class="exercise">
**Exercise**: What function from `readr` would you use to read the file `tcga_clinical.tsv` into R? Read the file in. What are the number of rows and columns?

</div>

```{r}
library(readr)
data <- read_tsv("raw_data/tcga_clinical.tsv")
data
```


You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or `View`ing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.

<div class="exercise">
**Exercise**: Use the `select` function in conjunction with `contains` and `starts_with` to identify columns that have Age or Stage information their name. The code should look like the following (you will need to fill-in the dots).

</div>

The functions `contains` and `starts_with` perform similar operations when used to select columns from a data frame. To use either, and to use the `select` function, we first have to load the `dplyr` library. Using the `contains` function will identify all columns that have a particular text pattern somewhere in their name. If we wanted all the columns with "age" in the name, the following wouldn't be a good choice as it would also identify columns with "stage" in.

```{r}
library(dplyr)
select(data, contains("age"))
```
But if we wanted all the columns regarding "stage", `contains` would be a good choice

```{r}
select(data, contains("stage"))
```
Since the age-related columns start with "age" we can use the `starts_with` function instead.

```{r}
select(data, starts_with("age"))
```
```{r}
select(data, contains("age"), 
       -contains("stage"), 
       -contains("agent"),
       -contains("heritage"),
       -contains("percentage"))
```


<div class="exercise">
**Exercise:** Use the `select` function to create a new data frame that contains the following columns. **These are not the actual columns names**
  - Tumour site
  - Race
  - Gender
  - Age at diagnosis
  - Dead / Alive Status
You can add extra columns if you wish

**See below for example output**
</div>

```{r message=FALSE}
library(tidyverse)
clin <- readr::read_tsv("raw_data/tcga_clinical.tsv")
data <- select(clin, 
                tumor_tissue_site,
                race,
                gender,
                age_at_initial_pathologic_diagnosis,
                vital_status)
data
```



<div class="exercise">
**Exercise:** Use the `dplyr` function called `count` to tabulate how which sites are included in the data. Re-arrange the output from `count` using `arrange` to determine the most common type of cancer in the dataset.

**See below for example output**
</div>


```{r }
count(data, tumor_tissue_site) %>% arrange(desc(n))

```

<div class="exercise">
**Exercise**: Not all samples have an entry for tumour type. Use the `filter` function to create a table with valid entries for `tumor_tissue_site`. Create a barplot to show display the number of occurences of each tumour type

HINT: An easy way to make the labels on the x-axis more legible is to use the `coord_flip` function

```{r eval=FALSE}
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
```

**See below for example output**
</div>

```{r}
  filter(data,!is.na(tumor_tissue_site)) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()
```




# Part2

We would like to visualise the age of diagnosis, and eventually compare between different disease types, The code we might think to use initially could look like:-

```{r}
## assuming your filtered clinical data is called data

ggplot(data, aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
This doesn't look like the desired output though. If we re-visit the data frame and print the "age" column we notice that the entries in the column are stored as "chr". i.e. characters or text



```{r}
select(data, age_at_initial_pathologic_diagnosis)
```

This has occurred because some entries are "`[Not Available]`" rather than a number or `NA`. As soon as R finds any text within the column, it treats everything in the column as text.

These entries can be filtered in the same manner as previously (when filtering the tissue type column), but this does not solve the problem entirely. 

```{r}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```

We need to add an additional step which will force R to treat the data in the `age_at_initial_pathologic_diagnosis` column as numerical data. Such a conversion can be done using the `as.numeric` function and the `mutate` function can be used to modify the `age_at_initial_pathologic_diagnosis` column to contain the numeric values

<div class="exercise">
**Exercise**: Use `mutate` and `as.numeric` to convert the values in `age_at_initial_pathologic_diagnosis` into numbers. You will still need to remove the 
`[Not Available]` values beforehand. Now try and create the density plot.
</div>

```{r}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
<div class="exercise">
**Exercise**: Use the `facet_wrap` function to compare the distribution of ages between different tumour types
</div>

```{r }
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
    filter(tumor_tissue_site != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
 ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density() + facet_wrap(~tumor_tissue_site)

```

<div class="exercise">
**Exercise**: Do any tumour types have a different age of diagnosis between males and females? Use a boxplot to find out
</div>

```{r}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
    filter(tumor_tissue_site != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
 ggplot(aes(x= gender, y = age_at_initial_pathologic_diagnosis)) + geom_boxplot() + facet_wrap(~tumor_tissue_site)
```

Lets now look at gender split for each cancer type. As a first step, we can group the data by gender and tissue type and obtain counts.

```{r}
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n())
```

These data are ready for plotting, but for comparisons we need to take into account the total number of each tissue type. We can create frequencies rather than absolute numbers by dividing by the total number of cases.

```{r}
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N))
```

Note that the order of the grouping is important here. If we reversed it to `gender` then `tumor_tissue_site` the frequencies would be calculated using the total of males of females.

```{r}
data %>% 
  group_by(gender,tumor_tissue_site) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N))
```


<div class="exercise">
**Exercise**: Create a plot to show the gender split in cases of each tumor type.
</div>

```{r }
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N)) %>% 
  ggplot(aes(x = gender, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```

<div class="exercise">
**Exercise**: Create a plot to show the proportion of patients dead or alive for each tumour type
</div>


```{r}
data %>% 
  group_by(tumor_tissue_site,vital_status) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  filter(vital_status != "[Not Available]") %>%
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N)) %>% 
  ggplot(aes(x = vital_status, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```

